Lab 6 CNN Lawrence Lim and
Matthew Grover
Preparation:
https://www.kaggle.com/datasets/shaunthesheep/microsoft-catsvsdogs-dataset?resource=download
Evaluation
Metric:
The
data
we
have
used
is
the
dogs
and
cats
dataset
provided
on
Kaggle,
which
offers
images
of
dogs
and
cats
online
as
provided
by
Microsoft
Research.
Thus,
the
problem
I
am
trying
to
solve
is
no
longer
medical
in
nature,
like
Lab
2,
but
rather,
applies
to
a
situation
that
does
not
emphasize
false
positives
or
false
negatives
as
being
more
critical.
Therefore,
a
metric
like
recall
that
emphasizes
finding
all
the
positive
instances,
even
if
there
are
some
false
positives,
might
not
be
an
applicable
option.
Furthermore,
a
metric
like
precision
that
is
used
in
applications
like
fraud
detection,
where
false
positives
have
consequences,
is
unnecessary.
To
further
narrow
down
the
options,
it
is
essential
that
I
evaluate
the
dataset
as
it
pertains
to
its
distribution
and
class
imbalance.
For
the
code
below,
I
am
analyzing
the
distribution
of
features
through
the
use
of
a
MobileNetV2
model
to
extract
features
and
perform
PCA
and
TSNE
to
reduce
the
dimensionality
so
we
can
see
the
results
in
a
graph
format.
It
is
important
to
note
that
I
initially
faced
issues
because
of
the
image
processing
being
too
computationally
intensive
for
Deepnote
to
handle
with
RAM
limited
to
5GB.
Thus,
I
loaded
and
process
the
images
in
batches.
In
addition,
I
resized
the
images
to
112112
to
reduce
the
amount
of
resources
required.
Finally,
I
switched
away
from
ResNet50,
which
though
it
has
50
layers
and
residual
connections
to
eliminate
the
issue
regarding
vanishing
gradients,
requires
a
lot
of
computational
power
and
is
not
worth
coordinating
to
use,
given
this
is
a
group
project.
MobileNetV2
uses
depthwise
separable
convolutions,
which
makes
it
less
intensive
but
comes
at
the
risk
of
being
unable
to
process
complex
image
recognition.
Considering
the
fact
that
we
are
simply
trying
to
differentiate
dogs
and
cats,
I
do
not
envision
this
being
a
problem.
The
results
below
show
that
the
PCA
has
better
separation
than
the
TSNE.
This
could
be
because
the
images
are
of
simple
structure
and
thus
PCA
can
effectively
capture
the
variation
in
the
images
using
a
linear
approach
where
images
are
vectorized
on
a
pixel-by-pixel
basis.
Furthermore,
it
is
important
to
consider
that
PCA
will
always
converge
to
the
global
minima
as
it
utilizes
convex
optimization
while
T
SNE
might
converge
at
a
local
minima,
which
can
decrease
performance.
Furthermore,
we
see
that
there
appear
to
be
more
blue
dots
than
orange
ones,
indicating
that
there
are
more
cat
images
than
dog
images.
This
is
something
I
confirm
in
the
following
code
block.
import
os
import
numpy
as
np
import
matplotlib.pyplot
as
plt
from
sklearn.decomposition
import
PCA
from
sklearn.manifold
import
TSNE
from
tensorflow.keras.applications.mobilenet_v2
import
MobileNetV2, preprocess_input
from
tensorflow.keras.preprocessing
import
image
# Function to load and process images from a folder in batches
def
load_and_process_images(folder, batch_size=32):
filenames = os.listdir(folder)
features = []
model = MobileNetV2(weights='imagenet', include_top=False, pooling='avg', input_shape=(112, 1
for
i
in
range(0, len(filenames), batch_size):
batch_images = []
for
filename
in
filenames[i:i+batch_size]:
img_path = os.path.join(folder, filename)
img = image.load_img(img_path, target_size=(112, 112))
img_array = image.img_to_array(img)
batch_images.append(img_array)
batch_images_preprocessed = preprocess_input(np.array(batch_images))
batch_features = model.predict(batch_images_preprocessed)
features.extend(batch_features)
return
features
# Load and process images in batches
cats = load_and_process_images('Cat')
dogs = load_and_process_images('Dog')
all_features = np.array(cats + dogs)
# Convert to NumPy array
# Perform PCA and t-SNE
pca = PCA(n_components=2)
features_pca = pca.fit_transform(all_features)
tsne = TSNE(n_components=2, perplexity=30, n_iter=1000)
features_tsne = tsne.fit_transform(all_features)
# Visualize features using PCA and t-SNE
def
plot_features_2d(features, title):
plt.scatter(features[:len(cats), 0], features[:len(cats), 1], label='Cat', alpha=0.5)
plt.scatter(features[len(cats):, 0], features[len(cats):, 1], label='Dog', alpha=0.5)
plt.legend()
plt.title(title)
plt.show()
plot_features_2d(features_pca, 'PCA')
plot_features_2d(features_tsne, 't-SNE')
1/1 [==============================] - 0s 53ms/step
1/1 [==============================] - 0s 53ms/step
1/1 [==============================] - 0s 50ms/step
1/1 [==============================] - 0s 53ms/step
1/1 [==============================] - 0s 53ms/step
/shared-libs/python3.9/py/lib/python3.9/site-packages/PIL/TiffImagePlugin.py:845: UserWarning: Truncate
warnings.warn(str(msg))
1/1 [==============================] - 0s 51ms/step
1/1 [==============================] - 0s 50ms/step
1/1 [==============================] - 0s 71ms/step
1/1 [==============================] - 0s 53ms/step
1/1 [==============================] - 0s 51ms/step
1/1 [==============================] - 0s 51ms/step
1/1 [==============================] - 0s 52ms/step
1/1 [==============================] - 0s 53ms/step
1/1 [==============================] - 0s 51ms/step
1/1 [==============================] - 0s 51ms/step
1/1 [==============================] - 0s 78ms/step
1/1 [==============================] - 0s 51ms/step
1/1 [==============================] - 0s 51ms/step
1/1 [==============================] - 0s 53ms/step
1/1 [==============================] - 0s 53ms/step
1/1 [==============================] - 0s 51ms/step
1/1 [==============================] - 0s 50ms/step
1/1 [==============================] - 0s 51ms/step
1/1 [==============================] - 0s 51ms/step
1/1 [==============================] - 0s 51ms/step
1/1 [ ] 0s 50ms/step
The
fact
that
there
is
a
significant
class
imbalance
means
that
precision
and
recall
are
not
options.
Furthermore,
accuracy
is
not
an
option
as
it
can
be
misleading.
In
a
highly
imbalanced
dataset,
the
accuracy
can
be
high
even
if
the
model
performs
poorly
on
the
minority
class(es)
that
are
underrepresented
in
the
dataset.
One
good
metric
option
that
I
should
use
would
be
Area
Under
the
Receiver
Operating
Characteristic
ROC
Curve
AUCROC.
This
metric
measures
the
ability
of
the
model
to
distinguish
between
the
positive
and
negative
classes
across
different
decision
thresholds
and
is
less
sensitive
to
imbalanced
datasets.
Furthermore,
The
F1
score
would
be
a
good
option
as
it
seeks
a
balance
between
precision
and
recall.
An
option
like,
balanced
accuracy,
the
average
of
the
sensitivity
(true
positive
rate)
and
specificity
(true
negative
rate),
though
resilient,
is
not
a
good
option
considering
we
have
a
very
high
level
of
imbalance.
import
os
# Function to count the number of images in a folder
def
count_images(folder):
return
len(os.listdir(folder))
# Count cat and dog images
num_cats = count_images('Cat')
num_dogs = count_images('Dog')
print
(f'Number of cat images: {num_cats}')
print
(f'Number of dog images: {num_dogs}')
The
Jaccard
similarity
value
for
PCA
suggests
that
when
using
PCA
for
dimensionality
reduction,
the
resulting
2D
features
provide
a
moderate
level
of
separation
between
cat
and
dog
images.
The
K-means
algorithm
was
able
to
group
the
images
into
clusters
that
are
approximately
54.85
%
similar
to
the
true
labels.
This
means
that
the
PCA
representation
has
some
ability
to
distinguish
between
cats
and
dogs
and
is
superior
to
the
TSNE
performance
of
less
than
1
%
.
Given
the
moderate
separation,
the
F1
score
and
AUCROC
remain
good
metric
evaluation
techniques
to
use.
from
sklearn.metrics
import
jaccard_score
from
sklearn.cluster
import
KMeans
def
compute_jaccard_similarity(features, cat_labels, dog_labels):
kmeans = KMeans(n_clusters=2, random_state=42)
kmeans.fit(features)
cat_cluster = kmeans.predict(features[:len(cats)])
dog_cluster = kmeans.predict(features[len(cats):])
jaccard_similarity = jaccard_score(np.concatenate((cat_labels, dog_labels)),
np.concatenate((cat_cluster, dog_cluster)))
return
jaccard_similarity
cat_labels = np.zeros(len(cats), dtype=int)
dog_labels = np.ones(len(dogs), dtype=int)
jaccard_pca = compute_jaccard_similarity(features_pca, cat_labels, dog_labels)
Number of cat images: 7117
Number of dog images: 2835
jaccard_tsne = compute_jaccard_similarity(features_tsne, cat_labels, dog_labels)
print
(f'Jaccard similarity (PCA): {jaccard_pca:.4f}')
print
(f'Jaccard similarity (t-SNE): {jaccard_tsne:.4f}')
Method
For
Dividing
Training/Testing:
Given
that
there
is
a
significant
class
imbalance,
simple
random
splitting
would
be
a
very
poor
option
as
it
would
introduce
bias
as
each
set
might
not
contain
an
equivalent/enough
of
a
particular
class.
Furthermore,
a
fixed
train-test
split
would
be
a
poor
option
as
we
read
in
images
sequentially
(cats
followed
by
dogs).
Thus,
it
is
possible
that
the
training
data
would
contain
almost
all
if
not
all
cats
with
the
testing
data
containing
the
remaining
dogs
with
little/no
cats.
The
method
we
should
go
with
would
be
stratified
k-fold
cross-validation
as
it
works
well
with
medium-sized
datasets
and
can
help
account
for
class
imbalance
by
ensuring
that
each
fold
of
the
cross-validation
retains
the
same
class
distribution
as
the
original
dataset.
Furthermore,
it
provides
a
more
robust
estimate
of
the
model's
performance.
By
repeating
the
cross-validation
process
multiple
times,
with
different
folds
for
training
and
testing,
stratified
k-fold
cross-validation
provides
a
more
robust
estimate
of
the
model's
performance
than
a
single
train-test
split.
Finally,
by
using
cross-validation
to
evaluate
the
model's
performance
for
different
hyperparameter
settings,
we
can
identify
the
hyperparameter
settings
that
generalize
best
to
new
data.
This
can
help
improve
the
model's
performance
in
practice
by
selecting
hyperparameter
settings
that
are
less
likely
to
overfit
the
training
data.
Modeling:
Set
Up
Training:
The
first
step
to
understanding
the
particular
data
augmentation
techniques
I
should
use
is
looking
at
the
dog
and
cat
images
to
see
commonalities
and
differences
in
the
images.
Below,
I
select
12
random
cat
images
and
12
random
dog
images.
I
have
run
the
code
block
a
few
times
to
look
at
different
images.
One
commonality
in
all
the
images
is
that
the
animal
almost
always
appears
upright
rather
than
sideways
or
upside
down.
Therefore,
rotation
is
not
a
necessary
augmentation
technique
to
implement.
However,
we
do
see
that
the
angle
an
image
is
taken
varies.
Thus,
I
will
use
flipping
to
ensure
that
we
mirror
the
images
during
training
in
an
attempt
to
better
prepare
the
model
for
different
angled
images.
Furthermore,
shearing,
or
distorting
an
image
along
an
axis,
is
also
a
good
method
of
preparing
the
model
for
various
angles.
Translation
will
also
be
used
to
ensure
that
the
model
can
account
for
the
rare
instance
the
dog
or
cat
is
not
centered
in
the
frame.
One
interesting
situation
I
will
have
to
account
for
is
if
there
are
additional
objects
along
with
the
animal.
One
common
example
is
that
people
might
be
holding
their
pets.
One
way
to
account
for
this
is
random
Jaccard similarity (PCA): 0.5500
Jaccard similarity (t-SNE): 0.0087
cropping.
In
this
augmentation
method,
we
select
the
top-left
coordinate
(x,
y)
of
the
crop
and
then
extract
a
fixed-size
sub-image
from
the
original
image.
It
is
important
to
note
that
Random
cropping
does
not
inherently
know
how
to
focus
on
the
dog
or
cat
and
eliminate
the
background
person
holding
the
animal.
Instead,
it
randomly
selects
different
regions
within
the
image
as
crops.
Over
time
and
with
multiple
crops,
the
model
is
exposed
to
various
parts
of
the
image,
which
includes
the
dog
or
cat
and
the
background
elements
like
the
person
holding
the
animal.
The
exposure
to
diverse
regions
of
the
image
encourages
the
model
to
learn
the
features
that
are
most
relevant
to
the
target
object
(e.g.,
the
dog
or
cat)
and
become
more
robust
to
variations
in
the
background
or
context.
Thus,
the
model
will
"learn"
that
the
person
in
the
background
is
not
relevant
while
the
characteristics
of
the
animal
in
their
arms
are.
import
os
import
random
import
numpy
as
np
import
matplotlib.pyplot
as
plt
from
tensorflow.keras.preprocessing
import
image
def
display_random_images(folder, num_images=12):
filenames = os.listdir(folder)
random.shuffle(filenames)
fig, axes = plt.subplots(3, 4, figsize=(12, 9))
fig.suptitle(f'{folder} Images', fontsize=16)
for
i, ax
in
enumerate(axes.flat):
if
i < num_images:
img_path = os.path.join(folder, filenames[i])
img = image.load_img(img_path, target_size=(112, 112))
ax.imshow(img)
ax.set_xticks([])
ax.set_yticks([])
else
:
ax.axis('off')
plt.show()
# Display 12 random cat images
display_random_images('Cat')
# Display 12 random dog images
display_random_images('Dog')
Note:
For
the
below
code,
due
to
RAM
constraints
on
Deepnote,
we
utilize
the
flow_from_directory
method
as
a
means
of
reading
images
from
the
directory
and
applying
the
augmentation
on
a
smaller
batch
of
32
images
that
utilizes
only
1GB
ram.
Furthermore,
we
ensure
that
the
image
density
is
112112
to
reduce
size.
The
below
code
produces
the
following
graphs:
Model
(create_model_1,
num_filters=32
Fold
1,
Accuracy
and
Loss
Model
(create_model_2,
num_filters=32
Fold
1,
Accuracy
and
Loss
Model
(create_model_1,
num_filters=64
Fold
1,
Accuracy
and
Loss
Model
(create_model_2,
num_filters=64
Fold
1,
Accuracy
and
Loss
Model
(create_model_1,
num_filters=32
Fold
2,
Accuracy
and
Loss
Model
(create_model_2,
num_filters=32
Fold
2,
Accuracy
and
Loss
Model
(create_model_1,
num_filters=64
Fold
2,
Accuracy
and
Loss
Model
(create_model_2,
num_filters=64
Fold
2,
Accuracy
and
Loss
Model
(create_model_1,
num_filters=32
Fold
3,
Accuracy
and
Loss
Model
(create_model_2,
num_filters=32
Fold
3,
Accuracy
and
Loss
Model
(create_model_1,
num_filters=64
Fold
3,
Accuracy
and
Loss
Model
(create_model_2,
num_filters=64
Fold
3,
Accuracy
and
Loss
Note
that
there
are
12
graphs
in
total,
with
each
model
variation
and
filter
size
combination
producing
an
accuracy
and
loss
plot
for
each
of
the
3
folds
used
in
the
stratified
k-fold
cross-validation.
The
difference
between
the
two
models
is
that
the
first
model
has
two
convolutional
layers
followed
by
max-pooling
layers,
while
Model
2
has
three
convolutional
layers
with
the
third
one
added
between
the
second
convolutional
layer
and
the
second
max-pooling
layer.
The
idea
is
that
this
additional
convolutional
layer
in
Model
2
can
potentially
help
the
model
learn
more
complex
features
from
the
input
images.
However,
this
does
come
with
the
risk
of
overfitting
the
image
data,
leading
to
an
increased
discrepancy
between
the
performance
on
the
training
and
validation
sets.
Finally,
as
we
learned
the
hard
way,
additional
layers
increase
the
time
needed
to
perform
computational
tasks.
The
code
block
below
took
well
over
4
hours
to
complete!
Though
accuracy
is
not
our
primary
metric,
we
do
see
that
there
is
a
larger
discrepancy
in
the
training
and
validation
accuracies
for
model
2
in
comparison
with
model
1,
which
has
training
and
validation
accuracies
that
remain
quite
similar.
This
supports
our
hypothesis
that
model
2
is
overfitting
the
data.
Furthermore,
when
looking
at
the
training
and
validation
loss
graph,
we
see
that
while
model
1
usually
shows
a
downward
trend
in
the
validation
loss
which
approaches
the
training
loss,
model
2
usually
ends
with
the
validation
loss
not
decreasing
throughout
the
epochs,
indicating
overfitting.
For
both
model
1
and
model
2,
we
see
that
though
there
is
a
clear
improvement
in
accuracy
during
training,
there
is
not
a
clear
improvement
in
the
validation
accuracy,
with
the
accuracy
even
sometimes
going
down
for
model
2.
import
matplotlib.pyplot
as
plt
import
os
import
numpy
as
np
from
tensorflow.keras.models
import
Sequential
from
tensorflow.keras.layers
import
Conv2D, MaxPooling2D, Flatten, Dense, Dropout
from
tensorflow.keras.preprocessing.image
import
ImageDataGenerator
from
tensorflow.keras.applications.mobilenet_v2
import
MobileNetV2, preprocess_input
from
sklearn.model_selection
import
StratifiedKFold
from
sklearn.metrics
import
f1_score, roc_auc_score
from
tensorflow.keras
import
backend
as
K
# Varation Model 1
def
create_model_1(num_filters=32):
model = Sequential([
Conv2D(num_filters, (3, 3), activation='relu', input_shape=(112, 112, 3)),
MaxPooling2D((2, 2)),
Conv2D(num_filters * 2, (3, 3), activation='relu'),
MaxPooling2D((2, 2)),
Flatten(),
Dense(64, activation='relu'),
Dropout(0.5),
Dense(1, activation='sigmoid')
])
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
return
model
# Varation Model 2
def
create_model_2(num_filters=32):
model = Sequential([
Conv2D(num_filters, (3, 3), activation='relu', input_shape=(112, 112, 3)),
MaxPooling2D((2, 2)),
Conv2D(num_filters * 2, (3, 3), activation='relu'),
Conv2D(num_filters * 4, (3, 3), activation='relu'),
MaxPooling2D((2, 2)),
Flatten(),
Dense(64, activation='relu'),
Dropout(0.5),
Dense(1, activation='sigmoid')
])
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
return
model
from
PIL
import
Image
image_size = (112, 112)
def
load_and_resize_image(image_path, target_size):
img = Image.open(image_path)
img = img.convert('RGB')
# Convert the image to RGB format
img = img.resize(target_size, Image.ANTIALIAS)
return
np.array(img)
cat_folder = 'Cat'
dog_folder = 'Dog'
batch_size = 64
cat_image_paths = [os.path.join(cat_folder, img)
for
img
in
os.listdir(cat_folder)]
dog_image_paths = [os.path.join(dog_folder, img)
for
img
in
os.listdir(dog_folder)]
cats = [load_and_resize_image(cat_image_path, image_size)
for
cat_image_path
in
cat_image_paths]
dogs = [load_and_resize_image(dog_image_path, image_size)
for
dog_image_path
in
dog_image_paths]
X = np.concatenate((np.array(cats), np.array(dogs)), axis=0)
y = np.array([0] * len(cats) + [1] * len(dogs))
# Stratified k-fold cross-validation
skf = StratifiedKFold(n_splits=3)
num_epochs = 10
f1_scores = {f"{model_creator.__name__}_num_filters={num_filters}": []
for
model_creator
in
[creat
auc_roc_scores = {f"{model_creator.__name__}_num_filters={num_filters}": []
for
model_creator
in
for
train_index, val_index
in
skf.split(X, y):
X_train, X_val = X[train_index], X[val_index]
y_train, y_val = y[train_index], y[val_index]
# Data augmentation generator
train_datagen = ImageDataGenerator(
preprocessing_function=preprocess_input,
horizontal_flip=True,
shear_range=0.2,
width_shift_range=0.2,
height_shift_range=0.2,
fill_mode='reflect',
zoom_range=0.2
)
train_datagen.fit(X_train)
# Train and evaluate each model
for
model_creator
in
[create_model_1, create_model_2]:
for
num_filters
in
[32, 64]:
model = model_creator(num_filters)
history = model.fit(
train_datagen.flow(X_train, y_train, batch_size=batch_size),
epochs=num_epochs,
steps_per_epoch=len(X_train) // batch_size,
validation_data=(X_val, y_val)
)
# Evaluate the model using F1 score and AUC-ROC
y_val_pred = model.predict(X_val)
f1 = f1_score(y_val, y_val_pred.round())
auc_roc = roc_auc_score(y_val, y_val_pred)
print
(f'Model ({model_creator.__name__}, num_filters={num_filters}): F1 Score = {f1},
f1_scores[f"{model_creator.__name__}_num_filters={num_filters}"].append(f1)
auc_roc_scores[f"{model_creator.__name__}_num_filters={num_filters}"].append(auc_roc)
# Visualize the performance of the training and validation sets per iteration
plt.figure(figsize=(12, 4))
plt.subplot(1, 2, 1)
plt.plot(history.history['accuracy'], label='Training')
plt.plot(history.history['val_accuracy'], label='Validation')
plt.xlabel('Epoch')
plt.ylabel('Accuracy')
plt.legend()
plt.subplot(1, 2, 2)
plt.plot(history.history['loss'], label='Training')
plt.plot(history.history['val_loss'], label='Validation')
plt.xlabel('Epoch')
plt.ylabel('Loss')
plt.legend()
plt.suptitle(f'Model ({model_creator.__name__}, num_filters={num_filters})')
plt.show()
K.clear_session()
/tmp/ipykernel_20557/1555039224.py:51: DeprecationWarning: ANTIALIAS is deprecated and will be removed
img = img.resize(target_size, Image.ANTIALIAS)
/shared-libs/python3.9/py/lib/python3.9/site-packages/PIL/TiffImagePlugin.py:845: UserWarning: Truncate
warnings.warn(str(msg))
Epoch 1/10
103/103 [==============================] - 23s 218ms/step - loss: 0.6606 - accuracy: 0.7017 - val_loss:
Epoch 2/10
103/103 [==============================] - 22s 217ms/step - loss: 0.5902 - accuracy: 0.7155 - val_loss:
Epoch 3/10
103/103 [==============================] - 22s 218ms/step - loss: 0.5740 - accuracy: 0.7155 - val_loss:
Epoch 4/10
103/103 [==============================] - 22s 217ms/step - loss: 0.5676 - accuracy: 0.7175 - val_loss:
Epoch 5/10
103/103 [==============================] - 22s 218ms/step - loss: 0.5521 - accuracy: 0.7202 - val_loss:
Epoch 6/10
103/103 [==============================] - 22s 217ms/step - loss: 0.5516 - accuracy: 0.7254 - val_loss:
Epoch 7/10
103/103 [==============================] - 22s 218ms/step - loss: 0.5464 - accuracy: 0.7317 - val_loss:
Epoch 8/10
103/103 [==============================] - 23s 220ms/step - loss: 0.5256 - accuracy: 0.7425 - val_loss:
Epoch 9/10
103/103 [==============================] - 22s 218ms/step - loss: 0.5164 - accuracy: 0.7516 - val_loss:
Epoch 10/10
103/103 [==============================] - 23s 218ms/step - loss: 0.5090 - accuracy: 0.7581 - val_loss:
104/104 [==============================] - 2s 18ms/step
Model (create_model_1, num_filters=32): F1 Score = 0.5406203840472674, AUC-ROC = 0.7283772243738531
Epoch 1/10
103/103 [==============================] - 51s 493ms/step - loss: 0.6417 - accuracy: 0.7053 - val_loss:
Epoch 2/10
103/103 [==============================] - 50s 490ms/step - loss: 0.6002 - accuracy: 0.7143 - val_loss:
Epoch 3/10
103/103 [==============================] - 51s 491ms/step - loss: 0.5870 - accuracy: 0.7151 - val_loss:
Epoch 4/10
103/103 [==============================] - 51s 491ms/step - loss: 0.5694 - accuracy: 0.7177 - val_loss:
Epoch 5/10
103/103 [==============================] - 51s 497ms/step - loss: 0.5593 - accuracy: 0.7157 - val_loss:
Epoch 6/10
103/103 [==============================] - 51s 493ms/step - loss: 0.5494 - accuracy: 0.7221 - val_loss:
Epoch 7/10
103/103 [==============================] - 50s 488ms/step - loss: 0.5428 - accuracy: 0.7244 - val_loss:
Epoch 8/10
103/103 [==============================] - 50s 486ms/step - loss: 0.5330 - accuracy: 0.7262 - val_loss:
Epoch 9/10
103/103 [==============================] - 50s 487ms/step - loss: 0.5312 - accuracy: 0.7329 - val_loss:
Epoch 10/10
103/103 [==============================] - 50s 482ms/step - loss: 0.5251 - accuracy: 0.7385 - val_loss:
104/104 [==============================] - 5s 47ms/step
Model (create_model_1, num_filters=64): F1 Score = 0.41060503059143444, AUC-ROC = 0.7088954887100695
Epoch 1/10
103/103 [==============================] - 50s 482ms/step - loss: 0.6694 - accuracy: 0.7062 - val_loss:
Epoch 2/10
103/103 [==============================] - 50s 484ms/step - loss: 0.5867 - accuracy: 0.7163 - val_loss:
Epoch 3/10
103/103 [==============================] - 50s 483ms/step - loss: 0.5690 - accuracy: 0.7170 - val_loss:
Epoch 4/10
103/103 [==============================] - 50s 483ms/step - loss: 0.5409 - accuracy: 0.7314 - val_loss:
Epoch 5/10
103/103 [==============================] - 50s 481ms/step - loss: 0.5295 - accuracy: 0.7482 - val_loss:
Epoch 6/10
103/103 [==============================] - 50s 482ms/step - loss: 0.5130 - accuracy: 0.7588 - val_loss:
Epoch 7/10
103/103 [==============================] - 50s 481ms/step - loss: 0.5142 - accuracy: 0.7629 - val_loss:
Epoch 8/10
103/103 [==============================] - 50s 482ms/step - loss: 0.5077 - accuracy: 0.7703 - val_loss:
Epoch 9/10
103/103 [==============================] - 52s 506ms/step - loss: 0.4859 - accuracy: 0.7714 - val_loss:
Epoch 10/10
103/103 [==============================] - 55s 530ms/step - loss: 0.4787 - accuracy: 0.7769 - val_loss:
104/104 [==============================] - 5s 48ms/step
Model (create_model_2, num_filters=32): F1 Score = 0.567350579839429, AUC-ROC = 0.7425695155151539
Epoch 1/10
103/103 [==============================] - 180s 2s/step - loss: 0.6489 - accuracy: 0.7091 - val_loss: 1
Epoch 2/10
103/103 [==============================] - 178s 2s/step - loss: 0.5792 - accuracy: 0.7279 - val_loss: 2
Epoch 3/10
103/103 [==============================] - 178s 2s/step - loss: 0.5574 - accuracy: 0.7376 - val_loss: 9
Epoch 4/10
103/103 [==============================] - 180s 2s/step - loss: 0.5344 - accuracy: 0.7473 - val_loss: 3
Epoch 5/10
103/103 [==============================] - 179s 2s/step - loss: 0.5316 - accuracy: 0.7498 - val_loss: 2
Epoch 6/10
103/103 [==============================] - 179s 2s/step - loss: 0.5214 - accuracy: 0.7543 - val_loss: 2
Epoch 7/10
103/103 [==============================] - 170s 2s/step - loss: 0.5091 - accuracy: 0.7682 - val_loss: 1
Epoch 8/10
103/103 [==============================] - 167s 2s/step - loss: 0.5027 - accuracy: 0.7653 - val_loss: 1
Epoch 9/10
103/103 [==============================] - 169s 2s/step - loss: 0.4892 - accuracy: 0.7662 - val_loss: 1
Epoch 10/10
103/103 [==============================] - 170s 2s/step - loss: 0.4809 - accuracy: 0.7753 - val_loss: 2
104/104 [==============================] - 18s 170ms/step
Model (create_model_2, num_filters=64): F1 Score = 0.39677619342839426, AUC-ROC = 0.6525207526471748
Epoch 1/10
103/103 [==============================] - 23s 218ms/step - loss: 0.6690 - accuracy: 0.7010 - val_loss:
Epoch 2/10
103/103 [==============================] - 22s 218ms/step - loss: 0.6035 - accuracy: 0.7145 - val_loss:
Epoch 3/10
103/103 [==============================] - 22s 217ms/step - loss: 0.5859 - accuracy: 0.7156 - val_loss:
Epoch 4/10
103/103 [==============================] - 22s 217ms/step - loss: 0.5745 - accuracy: 0.7175 - val_loss:
Epoch 5/10
103/103 [==============================] - 22s 217ms/step - loss: 0.5616 - accuracy: 0.7220 - val_loss:
Epoch 6/10
103/103 [==============================] - 22s 216ms/step - loss: 0.5481 - accuracy: 0.7296 - val_loss:
Epoch 7/10
103/103 [==============================] - 22s 218ms/step - loss: 0.5522 - accuracy: 0.7317 - val_loss:
Epoch 8/10
103/103 [==============================] - 23s 219ms/step - loss: 0.5394 - accuracy: 0.7370 - val_loss:
Epoch 9/10
103/103 [==============================] - 23s 225ms/step - loss: 0.5305 - accuracy: 0.7385 - val_loss:
Epoch 10/10
103/103 [==============================] - 23s 224ms/step - loss: 0.5286 - accuracy: 0.7407 - val_loss:
104/104 [==============================] - 2s 17ms/step
Model (create_model_1, num_filters=32): F1 Score = 0.5445582100872203, AUC-ROC = 0.7240047021244324
Epoch 1/10
103/103 [==============================] - 51s 492ms/step - loss: 0.6970 - accuracy: 0.6993 - val_loss:
Epoch 2/10
103/103 [==============================] - 51s 497ms/step - loss: 0.5969 - accuracy: 0.7151 - val_loss:
Epoch 3/10
103/103 [==============================] - 51s 493ms/step - loss: 0.5844 - accuracy: 0.7153 - val_loss:
Epoch 4/10
103/103 [==============================] - 52s 509ms/step - loss: 0.5781 - accuracy: 0.7153 - val_loss:
Epoch 5/10
103/103 [==============================] - 52s 507ms/step - loss: 0.5694 - accuracy: 0.7157 - val_loss:
Epoch 6/10
103/103 [==============================] - 51s 495ms/step - loss: 0.5522 - accuracy: 0.7147 - val_loss:
Epoch 7/10
103/103 [==============================] - 51s 497ms/step - loss: 0.5549 - accuracy: 0.7171 - val_loss:
Epoch 8/10
103/103 [==============================] - 49s 477ms/step - loss: 0.5443 - accuracy: 0.7261 - val_loss:
Epoch 9/10
103/103 [==============================] - 46s 446ms/step - loss: 0.5370 - accuracy: 0.7288 - val_loss:
Epoch 10/10
103/103 [==============================] - 46s 444ms/step - loss: 0.5252 - accuracy: 0.7337 - val_loss:
104/104 [==============================] - 4s 40ms/step
Model (create_model_1, num_filters=64): F1 Score = 0.45049504950495045, AUC-ROC = 0.7223995556626249
Epoch 1/10
103/103 [==============================] - 46s 442ms/step - loss: 0.6163 - accuracy: 0.7029 - val_loss:
Epoch 2/10
103/103 [==============================] - 47s 454ms/step - loss: 0.5797 - accuracy: 0.7276 - val_loss:
Epoch 3/10
103/103 [==============================] - 47s 455ms/step - loss: 0.5566 - accuracy: 0.7390 - val_loss:
Epoch 4/10
103/103 [==============================] - 51s 493ms/step - loss: 0.5437 - accuracy: 0.7382 - val_loss:
Epoch 5/10
103/103 [==============================] - 53s 514ms/step - loss: 0.5215 - accuracy: 0.7509 - val_loss:
Epoch 6/10
103/103 [==============================] - 54s 526ms/step - loss: 0.5075 - accuracy: 0.7585 - val_loss:
Epoch 7/10
103/103 [==============================] - 53s 513ms/step - loss: 0.4893 - accuracy: 0.7708 - val_loss:
Epoch 8/10
103/103 [==============================] - 52s 508ms/step - loss: 0.4829 - accuracy: 0.7774 - val_loss:
Epoch 9/10
103/103 [==============================] - 51s 499ms/step - loss: 0.4773 - accuracy: 0.7850 - val_loss:
Epoch 10/10
103/103 [==============================] - 50s 488ms/step - loss: 0.4690 - accuracy: 0.7760 - val_loss:
104/104 [==============================] - 5s 51ms/step
Model (create_model_2, num_filters=32): F1 Score = 0.23459915611814347, AUC-ROC = 0.5956324669646761
Epoch 1/10
103/103 [==============================] - 160s 2s/step - loss: 0.6547 - accuracy: 0.7032 - val_loss: 3
Epoch 2/10
103/103 [==============================] - 154s 1s/step - loss: 0.5858 - accuracy: 0.7201 - val_loss: 5
Epoch 3/10
103/103 [==============================] - 156s 2s/step - loss: 0.5721 - accuracy: 0.7265 - val_loss: 7
Epoch 4/10
103/103 [==============================] - 136s 1s/step - loss: 0.5609 - accuracy: 0.7364 - val_loss: 4
Epoch 5/10
103/103 [==============================] - 146s 1s/step - loss: 0.5464 - accuracy: 0.7404 - val_loss: 2
Epoch 6/10
103/103 [==============================] - 146s 1s/step - loss: 0.5348 - accuracy: 0.7446 - val_loss: 1
Epoch 7/10
103/103 [==============================] - 145s 1s/step - loss: 0.5274 - accuracy: 0.7515 - val_loss: 6
Epoch 8/10
103/103 [==============================] - 145s 1s/step - loss: 0.5209 - accuracy: 0.7510 - val_loss: 2
Epoch 9/10
103/103 [==============================] - 145s 1s/step - loss: 0.5056 - accuracy: 0.7634 - val_loss: 1
Epoch 10/10
103/103 [==============================] - 146s 1s/step - loss: 0.5003 - accuracy: 0.7672 - val_loss: 1
104/104 [==============================] - 15s 143ms/step
Model (create_model_2, num_filters=64): F1 Score = 0.44884115319389484, AUC-ROC = 0.6862456168526996
Epoch 1/10
103/103 [==============================] - 22s 208ms/step - loss: 0.6586 - accuracy: 0.7005 - val_loss:
Epoch 2/10
103/103 [==============================] - 22s 210ms/step - loss: 0.5932 - accuracy: 0.7153 - val_loss:
Epoch 3/10
103/103 [==============================] - 22s 212ms/step - loss: 0.5758 - accuracy: 0.7171 - val_loss:
Epoch 4/10
103/103 [==============================] - 22s 215ms/step - loss: 0.5716 - accuracy: 0.7156 - val_loss:
Epoch 5/10
103/103 [==============================] - 22s 211ms/step - loss: 0.5616 - accuracy: 0.7159 - val_loss:
Epoch 6/10
103/103 [==============================] - 21s 203ms/step - loss: 0.5506 - accuracy: 0.7151 - val_loss:
Epoch 7/10
103/103 [==============================] - 21s 206ms/step - loss: 0.5439 - accuracy: 0.7147 - val_loss:
Epoch 8/10
103/103 [==============================] - 21s 204ms/step - loss: 0.5423 - accuracy: 0.7150 - val_loss:
Epoch 9/10
103/103 [==============================] - 21s 207ms/step - loss: 0.5328 - accuracy: 0.7153 - val_loss:
Epoch 10/10
103/103 [==============================] - 22s 214ms/step - loss: 0.5292 - accuracy: 0.7148 - val_loss:
104/104 [==============================] - 2s 15ms/step
Model (create_model_1, num_filters=32): F1 Score = 0.0, AUC-ROC = 0.6854403668906198
Epoch 1/10
103/103 [==============================] - 47s 455ms/step - loss: 0.6858 - accuracy: 0.6968 - val_loss:
Epoch 2/10
103/103 [==============================] - 46s 443ms/step - loss: 0.5845 - accuracy: 0.7235 - val_loss:
Epoch 3/10
103/103 [==============================] - 46s 444ms/step - loss: 0.5715 - accuracy: 0.7357 - val_loss:
Epoch 4/10
103/103 [==============================] - 46s 441ms/step - loss: 0.5495 - accuracy: 0.7384 - val_loss:
Epoch 5/10
103/103 [==============================] - 48s 462ms/step - loss: 0.5391 - accuracy: 0.7484 - val_loss:
Epoch 6/10
103/103 [==============================] - 46s 448ms/step - loss: 0.5419 - accuracy: 0.7506 - val_loss:
Epoch 7/10
103/103 [==============================] - 45s 441ms/step - loss: 0.5186 - accuracy: 0.7615 - val_loss:
Epoch 8/10
103/103 [==============================] - 46s 443ms/step - loss: 0.5108 - accuracy: 0.7661 - val_loss:
Epoch 9/10
103/103 [==============================] - 47s 456ms/step - loss: 0.5001 - accuracy: 0.7702 - val_loss:
Epoch 10/10
103/103 [==============================] - 46s 443ms/step - loss: 0.4922 - accuracy: 0.7711 - val_loss:
104/104 [==============================] - 4s 41ms/step
Model (create_model_1, num_filters=64): F1 Score = 0.5238260869565217, AUC-ROC = 0.6859119176994388
Epoch 1/10
103/103 [==============================] - 45s 430ms/step - loss: 0.6313 - accuracy: 0.6999 - val_loss:
Epoch 2/10
103/103 [==============================] - 44s 428ms/step - loss: 0.5861 - accuracy: 0.7171 - val_loss:
Epoch 3/10
103/103 [==============================] - 46s 444ms/step - loss: 0.5700 - accuracy: 0.7279 - val_loss:
Epoch 4/10
103/103 [==============================] - 44s 426ms/step - loss: 0.5556 - accuracy: 0.7370 - val_loss:
Epoch 5/10
103/103 [==============================] - 50s 483ms/step - loss: 0.5292 - accuracy: 0.7454 - val_loss:
Epoch 6/10
103/103 [==============================] - 50s 485ms/step - loss: 0.5174 - accuracy: 0.7611 - val_loss:
Epoch 7/10
103/103 [==============================] - 52s 500ms/step - loss: 0.5062 - accuracy: 0.7580 - val_loss:
Epoch 8/10
103/103 [==============================] - 51s 490ms/step - loss: 0.4982 - accuracy: 0.7641 - val_loss:
Epoch 9/10
103/103 [==============================] - 50s 483ms/step - loss: 0.4855 - accuracy: 0.7775 - val_loss:
Epoch 10/10
103/103 [==============================] - 50s 489ms/step - loss: 0.4835 - accuracy: 0.7790 - val_loss:
104/104 [==============================] - 5s 47ms/step
Model (create_model_2, num_filters=32): F1 Score = 0.5710316081972906, AUC-ROC = 0.73738122005407
Epoch 1/10
103/103 [==============================] - 171s 2s/step - loss: 0.6260 - accuracy: 0.7055 - val_loss: 8
Epoch 2/10
103/103 [==============================] - 170s 2s/step - loss: 0.5718 - accuracy: 0.7255 - val_loss: 1
Epoch 3/10
103/103 [==============================] - 171s 2s/step - loss: 0.5545 - accuracy: 0.7373 - val_loss: 2
Epoch 4/10
103/103 [==============================] - 172s 2s/step - loss: 0.5313 - accuracy: 0.7419 - val_loss: 3
Epoch 5/10
103/103 [==============================] - 170s 2s/step - loss: 0.5258 - accuracy: 0.7497 - val_loss: 5
Epoch 6/10
103/103 [==============================] - 171s 2s/step - loss: 0.5174 - accuracy: 0.7480 - val_loss: 3
Epoch 7/10
103/103 [==============================] - 169s 2s/step - loss: 0.5015 - accuracy: 0.7562 - val_loss: 5
Epoch 8/10
103/103 [==============================] - 170s 2s/step - loss: 0.4939 - accuracy: 0.7626 - val_loss: 3
Epoch 9/10
103/103 [==============================] - 169s 2s/step - loss: 0.4789 - accuracy: 0.7658 - val_loss: 8
Epoch 10/10
103/103 [==============================] - 168s 2s/step - loss: 0.4680 - accuracy: 0.7784 - val_loss: 4
104/104 [==============================] - 18s 168ms/step
Model (create_model_2, num_filters=64): F1 Score = 0.4857142857142857, AUC-ROC = 0.627414411520651
Below,
we
summarize
the
average
AUCROC
score
and
F1
Score
for
each
cross-validation
given
the
model
type
and
the
different
number
of
filters.
There
are
two
Final
Results
printed
as
the
preceding
code
block
was
run
twice.
The
results
show
that
adding
additional
filters
with
the
idea
that
it
will
better
learn/extract
more
complex
features
from
the
image
data,
did
not
consistently
make
the
models
perform
any
better
in
terms
of
distinguishing
between
the
positive
and
negative
classes,
as
the
AUCROC
score
remained
constant.
This
could
indicate
that,
given
the
simplicity
of
the
image
data
being
merely
a
binary
classification
of
two
basic
animals,
the
ability
to
extract
complex
features
and
differences
does
not
have
a
real
benefit.
Overall,
the
fact
that
the
F1
scores
remained
below
0.5
suggests
that
the
models
did
a
below-average
job
in
terms
of
precision
and
recall
while
an
AUC
Score
0.6
suggest
they
did
a
good
job
in
classifying.
This
discrepancy
between
the
two
statistics
suggests
that
either
the
models
did
not
do
a
good
job
at
correctly
predicting
the
positive
class
and/or
the
number
of
correct
positive
predictions
made
out
of
all
the
positive
predictions
was
quite
low.
print
("\nFinal Results:")
for
model_name
in
f1_scores:
mean_f1 = np.mean(f1_scores[model_name])
std_f1 = np.std(f1_scores[model_name])
mean_auc_roc = np.mean(auc_roc_scores[model_name])
std_auc_roc = np.std(auc_roc_scores[model_name])
print
(f"{model_name}: Mean F1 Score = {mean_f1:.4f} (std = {std_f1:.4f}), Mean AUC-ROC = {mea
Final Results:
create_model_1_num_filters=32: Mean F1 Score = 0.3853 (std = 0.1423), Mean AUC-ROC = 0.6831 (std = 0.03
create_model_1_num_filters=64: Mean F1 Score = 0.3804 (std = 0.1334), Mean AUC-ROC = 0.6421 (std = 0.02
create_model_2_num_filters=32: Mean F1 Score = 0.4079 (std = 0.1968), Mean AUC-ROC = 0.6558 (std = 0.07
create_model_2_num_filters=64: Mean F1 Score = 0.4059 (std = 0.0769), Mean AUC-ROC = 0.6515 (std = 0.02
Final Results:
create_model_1_num_filters=32: Mean F1 Score = 0.3617 (std = 0.2558), Mean AUC-ROC = 0.7126 (std = 0.01
The
results
below
show
the
distribution
of
F1
and
AUCROC
scores
for
the
different
models.
We
see
that
increasing
the
number
of
filters
reduces
the
standard
deviation
of
the
model.
Interestingly,
it
appears
that
filters
make
the
model
more
consistent
at
the
expense
of
actual
prediction
performance.
One
explanation
for
this
is
that
the
additional
filters
cause
the
model
to
not
capture
new
important
insights,
but
irrelevant
noise
that
actually
makes
the
model
perform
worse
on
the
validation
data.
This
idea
is
supported
by
looking
at
the
final
validation
loss
for
the
graphs
above,
where
we
see
that
more
often
than
not,
the
validation
loss
for
32
filters
is
closer
to
the
training
loss
than
with
64
filters
along
with
the
difference
in
accuracy.
import
seaborn
as
sns
import
pandas
as
pd
# Box plots for F1 scores
f1_scores_df = pd.DataFrame.from_dict(f1_scores, orient='index').transpose()
plt.figure(figsize=(12, 6))
sns.boxplot(data=f1_scores_df)
plt.title("Box Plots of F1 Scores for Different Model Variations")
plt.xlabel("Model Variation")
plt.ylabel("F1 Score")
plt.xticks(rotation=45)
plt.show()
# Box plots for AUC-ROC scores
auc_roc_scores_df = pd.DataFrame.from_dict(auc_roc_scores, orient='index').transpose()
plt.figure(figsize=(12, 6))
sns.boxplot(data=auc_roc_scores_df)
plt.title("Box Plots of AUC-ROC Scores for Different Model Variations")
plt.xlabel("Model Variation")
plt.ylabel("AUC-ROC Score")
plt.xticks(rotation=45)
plt.show()
create_model_1_num_filters=64: Mean F1 Score = 0.4616 (std = 0.0469), Mean AUC-ROC = 0.7057 (std = 0.01
create_model_2_num_filters=32: Mean F1 Score = 0.4577 (std = 0.1577), Mean AUC-ROC = 0.6919 (std = 0.06
create_model_2_num_filters=64: Mean F1 Score = 0.4438 (std = 0.0365), Mean AUC-ROC = 0.6554 (std = 0.02
In
the
code
below,
we
try
to
determine
whether
or
not
there
is
a
statistically
significant
difference
in
the
average
AUC
ROC
and
F1
Scores
for
both
models
with
a
95
%
confidence
interval
for
the
two
times
we
ran
the
code
above.
In
all
instances,
the
p-value
is
greater
than
0.05
suggesting
there
is
not
a
significant
difference.
And
since
the
below
is
performing
a
two-tailed
t-test,
a
value
of
1.96
or
above
is
required
to
suggest
that
there
is
a
significant
difference
in
the
means
of
the
different
model
and
filter
comparisons.
from
itertools
import
combinations
from
scipy
import
stats
alpha = 0.05
# Function to perform pairwise t-tests and return the results
def
pairwise_t_tests(scores_dict, alpha):
significant_diff = []
not_significant_diff = []
# Iterate through all possible model pairs
for
model1, model2
in
combinations(scores_dict.keys(), 2):
scores1 = scores_dict[model1]
scores2 = scores_dict[model2]
# Perform t-test
t_test_result = stats.ttest_ind(scores1, scores2)
# Check for statistical significance
if
t_test_result.pvalue < alpha:
significant_diff.append((model1, model2, t_test_result))
else
:
not_significant_diff.append((model1, model2, t_test_result))
return
significant_diff, not_significant_diff
# Perform pairwise t-tests for F1 scores and AUC-ROC scores
f1_significant_diff, f1_not_significant_diff = pairwise_t_tests(f1_scores, alpha)
auc_roc_significant_diff, auc_roc_not_significant_diff = pairwise_t_tests(auc_roc_scores, alpha)
# Print results
print
("F1 Score Comparisons with Statistically Significant Differences:")
for
model1, model2, result
in
f1_significant_diff:
print
(f"{model1} vs {model2}: t-statistic = {result.statistic:.2f}, p-value = {result.pvalue:
print
("\nF1 Score Comparisons with No Statistically Significant Differences:")
for
model1, model2, result
in
f1_not_significant_diff:
print
(f"{model1} vs {model2}: t-statistic = {result.statistic:.2f}, p-value = {result.pvalue:
print
("\nAUC-ROC Score Comparisons with Statistically Significant Differences:")
for
model1, model2, result
in
auc_roc_significant_diff:
print
(f"{model1} vs {model2}: t-statistic = {result.statistic:.2f}, p-value = {result.pvalue:
print
("\nAUC-ROC Score Comparisons with No Statistically Significant Differences:")
for
model1, model2, result
in
auc_roc_not_significant_diff:
print
(f"{model1} vs {model2}: t-statistic = {result.statistic:.2f}, p-value = {result.pvalue:
create_model_1_num_filters=64 vs create_model_2_num_filters=64: t-statistic = -0.23, p-value = 0.8265
create_model_2_num_filters=32 vs create_model_2_num_filters=64: t-statistic = 0.01, p-value = 0.9899
AUC-ROC Score Comparisons with Statistically Significant Differences:
AUC-ROC Score Comparisons with No Statistically Significant Differences:
create_model_1_num_filters=32 vs create_model_1_num_filters=64: t-statistic = 1.41, p-value = 0.2304
create_model_1_num_filters=32 vs create_model_2_num_filters=32: t-statistic = 0.46, p-value = 0.6680
create_model_1_num_filters=32 vs create_model_2_num_filters=64: t-statistic = 1.10, p-value = 0.3330
create_model_1_num_filters=64 vs create_model_2_num_filters=32: t-statistic = -0.25, p-value = 0.8184
create_model_1_num_filters=64 vs create_model_2_num_filters=64: t-statistic = -0.43, p-value = 0.6873
create_model_2_num_filters=32 vs create_model_2_num_filters=64: t-statistic = 0.08, p-value = 0.9421
F1 Score Comparisons with Statistically Significant Differences:
F1 Score Comparisons with No Statistically Significant Differences:
Comparing
the
Model
Above
with
MLP
Below,
we
utilize
a
multi-layer
perceptron
that
consists
of
three
dense
layers,
the
first
two
having
512
and
256
neurons
with
ReLU
activation
function
respectively,
and
the
last
layer
with
a
single
neuron
and
sigmoid
activation
function.
Furthermore,
a
dropout
of
0.5
is
utilized
in
an
attempt
to
prevent
overfitting.
The
model
is
compiled
with
binary
cross-entropy
loss
function,
Adam
optimizer,
and
AUC
ROC
and
F1
Score
as
the
metric.
The
output
of
the
below
code
is
an
AUC
and
F1
Score
corresponding
to
each
fold
of
the
6
cross-validation
folds.
The
results
show
that
MLP
performance
is
far
worse
than
the
CNN
as
the
F1
score
is
0
for
5
out
of
6
of
the
results.
This
means
that
the
MLP
was
almost
completely
incapable
of
predicting
the
true
positives
or
false
positives.
Furthermore,
with
AUC
scores
closer
to
0.5,
we
see
that
the
MLP
only
performs
a
little
better
than
just
randomly
guessing
dog/cat
for
each
image.
The
fact
that
the
MLP
performs
worse
than
the
CNN
is
expected
and
could
be
caused
by
the
fact
that
CNN
can
learn
relevant
features
from
raw
input
data,
such
as
edges,
corners,
and
textures,
by
sliding
filters
over
the
input
image
and
extracting
local
patterns.
Thus,
there
is
spatial
relation
of
pixels
in
CNN.
This
is
a
contrast
to
the
MLP
model
that
treats
each
pixel
of
the
image
as
its
own
separate
feature.
Furthermore,
CNN
incorporates
pooling
layers,
which
merge
the
features
from
one
layer
before
moving
on
to
the
next
layer.
The
benefit
of
this
approach
is
that
the
model
is
resilient
to
the
variations
in
the
position
of
a
particular
feature/element
in
the
image,
as
every
image
taken
is
not
identical
in
terms
of
position,
size,
etc.
from
tensorflow.keras.layers
import
Dense
import
pandas
as
pd
import
scipy.stats
as
stats
def
create_mlp_model():
model = Sequential([
Flatten(input_shape=(112, 112, 3)),
Dense(512, activation='relu'),
Dropout(0.5),
Dense(256, activation='relu'),
Dropout(0.5),
Dense(1, activation='sigmoid')
])
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
return
model
mlp_f1_scores = []
mlp_auc_roc_scores = []
AUC-ROC Score Comparisons with Statistically Significant Differences:
AUC-ROC Score Comparisons with No Statistically Significant Differences:
create_model_1_num_filters=32 vs create_model_1_num_filters=64: t-statistic = 0.40, p-value = 0.7116
create_model_1_num_filters=32 vs create_model_2_num_filters=32: t-statistic = 0.41, p-value = 0.6997
create_model_1_num_filters=32 vs create_model_2_num_filters=64: t-statistic = 2.62, p-value = 0.0588
create_model_1_num_filters=64 vs create_model_2_num_filters=32: t-statistic = 0.28, p-value = 0.7923
create_model_1_num_filters=64 vs create_model_2_num_filters=64: t-statistic = 2.50, p-value = 0.0664
Lawrence's
Workspace
/
Lab_6
Published
at
May
3,
2023
Private
for
train_index, val_index
in
skf.split(X, y):
X_train, X_val = X[train_index], X[val_index]
y_train, y_val = y[train_index], y[val_index]
# Train and evaluate the MLP model
mlp_model = create_mlp_model()
history = mlp_model.fit(
train_datagen.flow(X_train, y_train, batch_size=batch_size),
epochs=num_epochs,
steps_per_epoch=len(X_train) // batch_size,
validation_data=(X_val, y_val)
)
# Evaluate the model using F1 score and AUC-ROC
y_val_pred = mlp_model.predict(X_val)
f1 = f1_score(y_val, y_val_pred.round())
auc_roc = roc_auc_score(y_val, y_val_pred)
print
(f'MLP Model: F1 Score = {f1}, AUC-ROC = {auc_roc}')
mlp_f1_scores.append(f1)
mlp_auc_roc_scores.append(auc_roc)
# Flatten the list of lists for each metric
cnn_f1_scores = [score
for
model_scores
in
f1_scores.values()
for
score
in
model_scores]
cnn_auc_roc_scores = [score
for
model_scores
in
auc_roc_scores.values()
for
score
in
model_scores
The
results
below
clearly
show
through
t-statistics>1.96
and
p-values
0.05,
that
there
is
a
clear
statistically
significant
difference
between
the
F1
scores
and
the
AUCROC
scores
when
comparing
the
CNN
to
the
MLP.
# Perform paired t-tests for F1 scores and AUC-ROC scores
f1_t_test_result = stats.ttest_ind(cnn_f1_scores, mlp_f1_scores, equal_var=False)
auc_roc_t_test_result = stats.ttest_ind(cnn_auc_roc_scores, mlp_auc_roc_scores, equal_var=False)
print
("Paired t-test results for F1 scores:", f1_t_test_result)
print
("Paired t-test results for AUC-ROC scores:", auc_roc_t_test_result)
Exceptional
Work:
Transfer
Learning
with
Pre-Trained
Weights
from
MobileNetV2
The
code
below
uses
transfer
learning
to
improve
the
network
using
pre-trained
weights
from
MobileNetV2's
imagenet
weights.
In
terms
of
performance,
both
the
average
F1
score
(.321
and
average
AUCROC
score
(.631
for
the
transfer
learning
model
were
lower
than
all
of
the
average
scores
of
the
CNN
models
tested
in
the
previous
sections.
This
suggests
that
using
Imagenet
to
pre-train
the
network
did
not
improve
its
performance.
import
numpy
as
np
import
os
from
PIL
import
Image
from
tensorflow.keras.models
import
Sequential
from
tensorflow.keras.layers
import
Flatten, Dense, Dropout
from
tensorflow.keras.preprocessing.image
import
ImageDataGenerator
from
tensorflow.keras.applications.mobilenet_v2
import
MobileNetV2, preprocess_input
from
sklearn.model_selection
import
StratifiedKFold
from
sklearn.metrics
import
f1_score, roc_auc_score
def
load_and_resize_image(image_path, target_size):
103/103 [==============================] - 16s 153ms/step - loss: 0.6135 - accuracy: 0.7040 - val_loss
Epoch 8/10
103/103 [==============================] - 16s 155ms/step - loss: 0.6116 - accuracy: 0.7105 - val_loss
Epoch 9/10
103/103 [==============================] - 16s 159ms/step - loss: 0.6036 - accuracy: 0.7098 - val_loss
Epoch 10/10
Paired t-test results for F1 scores: Ttest_indResult(statistic=9.076765496037304, pvalue=1.928857504939
Paired t-test results for AUC-ROC scores: Ttest_indResult(statistic=4.32366716576623, pvalue=0.00681802
Paired t-test results for F1 scores: Ttest_indResult(statistic=8.223723726566917, pvalue=1.883027979974
Paired t-test results for AUC-ROC scores: Ttest_indResult(statistic=5.105043376465456, pvalue=0.0097403
img = Image.open(image_path)
img = img.convert('RGB')
# Convert the image to RGB format
img = img.resize(target_size, Image.ANTIALIAS)
return
np.array(img)
def
create_transfer_learning_model():
base_model = MobileNetV2(weights='imagenet', include_top=False, input_shape=(112, 112, 3))
model = Sequential([
base_model,
Flatten(),
Dense(64, activation='relu'),
Dropout(0.5),
Dense(1, activation='sigmoid')
])
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
return
model
# Load data
image_size = (112, 112)
cat_folder = 'Cat'
dog_folder = 'Dog'
cat_image_paths = [os.path.join(cat_folder, img)
for
img
in
os.listdir(cat_folder)]
dog_image_paths = [os.path.join(dog_folder, img)
for
img
in
os.listdir(dog_folder)]
cats = [load_and_resize_image(cat_image_path, image_size)
for
cat_image_path
in
cat_image_paths]
dogs = [load_and_resize_image(dog_image_path, image_size)
for
dog_image_path
in
dog_image_paths]
X = np.concatenate((np.array(cats), np.array(dogs)), axis=0)
y = np.array([0] * len(cats) + [1] * len(dogs))
# Data split and augmentation
skf = StratifiedKFold(n_splits=3)
num_epochs = 3
batch_size = 64
# Evaluation metrics for transfer learning model
transfer_f1_scores = []
transfer_auc_roc_scores = []
for
train_index, val_index
in
skf.split(X, y):
X_train, X_val = X[train_index], X[val_index]
y_train, y_val = y[train_index], y[val_index]
# Data augmentation generator
train_datagen = ImageDataGenerator(
preprocessing_function=preprocess_input,
horizontal_flip=True,
shear_range=0.2,
width_shift_range=0.2,
height_shift_range=0.2,
fill_mode='reflect',
zoom_range=0.2
)
train_datagen.fit(X_train)
# Train and evaluate the transfer learning model
transfer_model = create_transfer_learning_model()
history = transfer_model.fit(
train_datagen.flow(X_train, y_train, batch_size=batch_size),
epochs=num_epochs,
steps_per_epoch=len(X_train) // batch_size,
validation_data=(X_val, y_val)
)
# Evaluate the model using F1 score and AUC-ROC
y_val_pred = transfer_model.predict(X_val)
f1 = f1_score(y_val, y_val_pred.round())
auc_roc = roc_auc_score(y_val, y_val_pred)
print
(f'Transfer Learning Model: F1 Score = {f1}, AUC-ROC = {auc_roc}')
transfer_f1_scores.append(f1)
transfer_auc_roc_scores.append(auc_roc)
# Calculate mean and standard deviation of F1 scores and AUC-ROC scores
mean_transfer_f1_score = np.mean(transfer_f1_scores)
std_transfer_f1_score = np.std(transfer_f1_scores)
mean_transfer_auc_roc_score = np.mean(transfer_auc_roc_scores)
std_transfer_auc_roc_score = np.std(transfer_auc_roc_scores)
print
("Transfer Learning Model Performance:")
print
(f"Mean F1 Score: {mean_transfer_f1_score:.4f}, Standard Deviation: {std_transfer_f1_score:.4
print
(f"Mean AUC-ROC Score: {mean_transfer_auc_roc_score:.4f}, Standard Deviation: {std_transfer_a
warnings.warn(str(msg))
WARNING:tensorflow:`input_shape` is undefined or non-square, or `rows` is not in [96, 128, 160, 192, 22
Epoch 1/3
103/103 [==============================] - 68s 631ms/step - loss: 0.3540 - accuracy: 0.8907 - val_loss
Epoch 2/3
103/103 [==============================] - 64s 624ms/step - loss: 0.1891 - accuracy: 0.9323 - val_loss
Epoch 3/3
103/103 [==============================] - 64s 623ms/step - loss: 0.1497 - accuracy: 0.9457 - val_loss
104/104 [==============================] - 7s 59ms/step
Transfer Learning Model: F1 Score = 0.07302231237322515, AUC-ROC = 0.7454156883992535
WARNING:tensorflow:`input_shape` is undefined or non-square, or `rows` is not in [96, 128, 160, 192, 22
Epoch 1/3
103/103 [==============================] - 68s 631ms/step - loss: 0.3726 - accuracy: 0.8770 - val_loss
Epoch 2/3
103/103 [==============================] - 64s 623ms/step - loss: 0.2136 - accuracy: 0.9262 - val_loss
Epoch 3/3
103/103 [==============================] - 64s 624ms/step - loss: 0.1674 - accuracy: 0.9393 - val_loss
104/104 [==============================] - 7s 59ms/step
Transfer Learning Model: F1 Score = 0.4469750889679716, AUC-ROC = 0.6487613426483578
WARNING:tensorflow:`input_shape` is undefined or non-square, or `rows` is not in [96, 128, 160, 192, 22
Epoch 1/3
from
tensorflow.keras.layers
import
SeparableConv2D
from
tensorflow.keras.layers
import
BatchNormalization
from
tensorflow.keras.layers
import
Add, Flatten, Dense
from
tensorflow.keras.layers
import
average, concatenate
from
tensorflow.keras.layers
import
Input
from
tensorflow.keras.models
import
Model
NUM_CLASSES = 2
# let's add a fully-connected layer
input_x = Input(shape=x_train_mnv2[0].shape)
x = Flatten()(input_x)
x = Dense(200, activation='relu',kernel_initializer='he_uniform')(x)
# and a fully connected layer
predictions = Dense(NUM_CLASSES, activation='softmax', kernel_initializer='glorot_uniform')(x)
model = Model(inputs=input_x, outputs=predictions)
model.summary()
Epoch 3/3
103/103 [==============================] - 65s 628ms/step - loss: 0.2160 - accuracy: 0.9218 - val_loss
104/104 [==============================] - 7s 58ms/step
Transfer Learning Model: F1 Score = 0.44345377756921633, AUC-ROC = 0.5
Transfer Learning Model Performance:
Mean F1 Score: 0.3212, Standard Deviation: 0.1755
=================================================================
input_9 (InputLayer) [(None, 112, 112, 3)] 0
flatten_8 (Flatten) (None, 37632) 0
dense_22 (Dense) (None, 200) 7526600
dense_23 (Dense) (None, 2) 402
=================================================================
Total params: 7,527,002
Trainable params: 7,527,002
Non-trainable params: 0
_________________________________________________________________
Model: "model"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
input_2 (InputLayer) [(None, 1280)] 0
flatten_3 (Flatten) (None, 1280) 0
=================================================================
Total params: 256,602
Trainable params: 256,602
Non-trainable params: 0